R is a great resource that has extremely good documentation, including several free books that systematically teach users the basics and then some. This tutorial is not meant to replace any such resources but to provide an opportunity to practice your skills in R and R Studio by building bar charts. If you are interested in learning more R there will be a recommended book list at the end.
It is recommended that you have used R and the Tidyverse before. If you are new to R or the Tidyverse, we’d recommend starting with the UFOs Tutorial.
This tutorial is meant to test your skills and knowledge of R and the Tidyverse. While the code for each graph is available, it will be hidden to begin with. Try to recreate each graph using just the information given on it before looking at the original code.
If you already have set up a R Studio Account or are using R Studio, you can skip to loading in your libraries
To complete this tutorial, you will need to create an account with RStudio Cloud so navigate to their homepage and click Get Started.
Once you’ve made your account, you will be navigated to Your Workspace. Click the blue New Project button to the right of Your Projects and wait for the Deploying Project bubbles to disappear.
Under File highlight New File and select R Script.
Go ahead and save the file. This will be where we will be executing all our commands.
Open Tools and click on Global Options. There will be four sections, R Sessions, Workspace, History and Other. Under Workspace, uncheck Restore .RData into workspace at startup and change Save workspace to .RData on exit to never. Hit Ok to save and exit.
By changing these options, we are making it a little harder to pick up where you left off in a project after you close it, but this will make your code reproducible, allowing you to get the same results every time you run your code.
All of the functions we will be using are well documented. If you ever have questions, run them in the console, (lower left pane) with a ?in front, e.g. ?help()
This tutorial will be using the tidyverse, ggstance and ggthemes. If you don’t have a library installed, you can install it with a console command (i.e. install.packages("tidyverse")).
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
library(ggthemes)
You only need to install packages once, after that you can always load them for use using the library() call.
We will be working with a semi-cleaned version of a UFO data set from Kaggle. The original data set can be found here. We will be only looking at data for the US. You can get the zipped file here: ufos.zip. R Studio Cloud only takes data in a zipped format, so we will upload the data set into our environment as a zip file.
Under the Files tab in the lower right pane, click on Upload button. Find the zipped file called ufos on your computer and upload it. You should see ufos.csv appear in your files.
Read in the csv as a tibble, which is a slightly more flexible version of a data frame and store it as a variable which we will use to call it for the rest of the lab.
ufos <- as_tibble(read_csv("ufos.csv"))
You will get a print out that looks similar to the one below.
You can ignore it.
You should see ufos show up as data set in your Global Environment on the upper right pane of your screen. If you click on it, you will open the data in a new tab.
Let’s take a quick look at our data using glimpse(), which will give us a compressed view of the data. Take note of the categorical and numerical types of entries; we will be using most of these later.
glimpse(ufos)
## Observations: 65,114
## Variables: 14
## $ datetime <dttm> 1949-10-10 20:30:00, 1956-10-10 21:00:00...
## $ city <chr> "san marcos", "edna", "kaneohe", "bristol...
## $ state <chr> "tx", "tx", "hi", "tn", "ct", "al", "fl",...
## $ country <chr> "us", "us", "us", "us", "us", "us", "us",...
## $ shape <chr> "cylinder", "circle", "light", "sphere", ...
## $ duration..seconds. <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, ...
## $ duration..hours.min. <chr> "45 minutes", "1/2 hour", "15 minutes", "...
## $ comments <chr> "This event took place in early fall arou...
## $ date.posted <date> 2004-04-27, 2004-01-17, 2004-01-22, 2007...
## $ latitude <dbl> 29.88306, 28.97833, 21.41806, 36.59500, 4...
## $ longitude <dbl> -97.94111, -96.64583, -157.80361, -82.188...
## $ year <dbl> 1949, 1956, 1960, 1961, 1965, 1966, 1966,...
## $ month <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ day <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
Being able to work with time is an important skill when it comes to data. It also tends to work well with bar charts. The plots below will start with some basic graphs over time and move into specific filtering. These plots will be using some styling elements, and each will have a theme.
ufos %>%
ggplot(aes(year)) +
geom_bar(fill = "steelblue") +
theme_clean() +
labs(
subtitle = "Observations by year",
caption = "Theme: theme_clean"
)
Hint: Look at the names of the variables on the x and y axis. How can we get our bar chart to have those names?
Want to add labels to your graphs? Use labs() with title, subtitle, caption, x and y.
ufos %>%
filter(
year >= 1975
) %>%
ggplot(aes(year)) +
geom_bar(fill = "steelblue") +
theme_clean() +
labs(
subtitle = "Observations since 1975",
caption = "Theme: theme_clean"
)
Hint: year is currently being read as numerical. How can we use that to our advantage when we filter it?
ufos %>%
filter(
between(year, 1975, 2000)
) %>%
ggplot(aes(year)) +
geom_bar(fill = "steelblue") +
theme_clean() +
labs(
subtitle = "Observations between 1975 and 2000",
caption = "Theme: theme_clean"
)
Hint: ?between()
ufos %>%
filter(
year >= 1975,
state == "co"
) %>%
ggplot(aes(year)) +
geom_bar(fill = "steelblue") +
theme_clean() +
labs(
subtitle = "Observations since 1975 for Colorado (co)",
caption = "Theme: theme_clean"
)
ufos %>%
ggplot(aes(as.factor(month))) +
geom_bar(fill = "steelblue") +
theme_clean() +
labs(
subtitle = "Observations by month",
caption = "Theme: theme_clean"
)
ufos %>%
filter(
between(year, 2000, 2003)
) %>%
ggplot(aes(as.factor(month))) +
geom_bar(aes(fill = as.factor(year)), position = "dodge") +
theme_fivethirtyeight() +
labs(
subtitle = "Observations by Month and year between 2000 and 2003",
caption = "Theme: theme_fivethirtyeight
Fill: scale_fill_fivethirtyeight()"
) +
scale_fill_fivethirtyeight()
Another common plot you will come across, which we tried out in the last plot is two variables, colored by a third variable. The following plots will be doing a lot of counting and filtering. They will also be using the geom_barh() from the ggstance package.
ufos %>%
count(state) %>%
filter(n > 1000) %>%
ggplot(aes(n, reorder(state, n))) +
geom_barh(stat = "identity",
fill = "indianred4",
color = "black") +
labs(
subtitle = "States with more than 1K UFO sightings",
caption = "Theme: theme_gdocs"
) +
theme_gdocs()
ufos %>%
count(state) %>%
filter(n < 500) %>%
ggplot(aes(n, reorder(state, n))) +
geom_barh(stat = "identity",
fill = "indianred4",
color = "black") +
labs(
subtitle = "States with less than 500 UFO sightings",
caption = "Theme: theme_gdocs"
) +
theme_gdocs()
For the next four graphs, we want to focus on the shapes of UFOs. However, since there are a lot of shapes, we will only look at the top 5.
ufos %>% count(shape) %>% arrange(desc(n))
## # A tibble: 29 x 2
## shape n
## <chr> <int>
## 1 light 13473
## 2 triangle 6549
## 3 circle 6118
## 4 fireball 5148
## 5 unknown 4567
## 6 other 4466
## 7 sphere 4347
## 8 disk 4121
## 9 oval 3032
## 10 formation 1990
## # ... with 19 more rows
To make the data set, we need to filter so we only have observations related to light, triangle, circle, fireball and sphere. We also added a row to our data set that contains the number of observations by state.
ufos_shapes <- ufos %>%
filter( shape == "light" |
shape == "triangle" |
shape == "circle" |
shape == "fireball" |
shape == "sphere") %>%
group_by(state) %>%
mutate(n = n()) %>%
ungroup()
ufos_shapes %>%
filter(n > 1000) %>%
ggplot(aes(reorder(state, n))) +
geom_bar(aes(fill = shape)) +
labs(
subtitle = "States with more than 1K UFO sightings",
caption = "Theme: theme_gdocs"
) +
theme_gdocs() +
scale_fill_gdocs()+
coord_flip()
ufos_shapes %>%
filter(n < 500) %>%
ggplot(aes(reorder(state, n))) +
geom_bar(aes(fill = shape)) +
labs(
subtitle = "States with less than 500 UFO sightings",
caption = "Theme: theme_gdocs"
) +
theme_gdocs() +
scale_fill_gdocs()+
coord_flip()
ufos_shapes %>%
filter( shape == "light" |
shape == "triangle" |
shape == "circle" ) %>%
filter(n < 500) %>%
ggplot(aes(reorder(state, n))) +
geom_bar(aes(fill = shape)) +
labs(
subtitle = "States with less than 500 UFO sightings",
caption = "Theme: theme_tufte",
x = "State",
y = "Observations"
) +
theme_tufte() +
theme(legend.position = "none") +
facet_wrap(~shape) +
scale_fill_brewer(type = "qual") +
coord_flip()
Hint: ?facet_wrap(); (theme(legend.position = "none"))https://www.datanovia.com/en/blog/how-to-remove-legend-from-a-ggplot/#ggplot-with-no-legend
Another type of graph you will want to use is had calculated values. The best way to do this is with summarize:
ufos_shapes %>%
group_by(shape) %>%
summarise(average_min = mean(duration..seconds./60)) %>%
ungroup() %>%
ggplot(aes(shape, average_min)) +
geom_bar(fill = "#55752f",
color ="grey9",
stat = "identity") +
theme_pander() +
labs(
title = "Average minutes per shape",
caption = "Calculated with duration..seconds. using ufo_shapes
theme: theme_pander
fill: #55752f
color: grey9"
)
Hint: Use summarise and mean
Congratulations on finishing the tutorial! Now see if you can think up and answer some of your own questions!
Did you love R? You can download it onto your computer for unlimited and offline use!
Delve a little deeper into R with one of these books: